APPLICATIONS OF CLUSTER ANALYSIS 
TO SOME PROBLEMS OF INTEREST 
TO THE U. S. DEPARTMENT OF STATE 



James Richard Lampinq 






^lOirniBE^ 



calif. 




p' 



i^ud I yiif 



r 




6 f. 



y y ii 



lonterey, California 






APPLICATIONS OF CLUSTER ANALYSIS 
TO SOME PROBLEMS OF INTEREST 
TO THE U. S. DEPARTMENT OF STATE 



by 

O'ames Richard Lamping 

Thesis Adviser : • B. Shubert 



September 1973 



Afipfiovzd iJoA. pub-tic AeX& 06 e.; dlstfubution untanitzd. 

T155255 



Applications of Cluster Analysis 
to Some Problems of Interest 
to the U. S. Department of State 

by 



James Richard , ^Lamping 
Lieutenant Commander, United States Navy 
B.S., University of Notre Dame, 1964 



Submitted in partial fulfillment of the 
requirements for the degree of 



MASTER OF SCIENCE IN OPERATIONS RESEARCH 

from the 



NAVAL POSTGRADUATE SCHOOL 
September 1973 



library 



na/al postgraduate school 

MONTEREY, CALIF. 93940 



ABSTRACT 



This thesis asserts that Cluster Analysis, or Numerical 
Taxonomy, has many potential applications in the field of 
international relations. It demonstrates two representative 
applications. Both examples treat the nations of the world 
as objects having measurable attributes, and both examples 
use selected attributes to produce a dendrogram (or 
hierarchical classification) of the nations of the world. 

In one example this dendrogram is used to objectively group 
the nations into blocs based on external economic ties. In 
the other example the dendrogram is used to highlight inter- 
actions among five attributes, ignoring the identity of 
Individual nations, the same way a scatter plot highlights 
interactions between two variables. 
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I . BACKGROUND 



A. CLUSTER ANALYSIS DEFINED 

The subject of this thesis is a group of mathematical 
techniques known collectively as either Cluster Analysis or 
Numerical Taxonomy. The terms are equivalent. Their formal 
definition, paraphrased from Ref. 6, is actually a sequence 
of definitions, as follows 

classification system - a set of subsets of a set of 

objects which conveys some 
information about the objects 

taxonomy - the science of constructing classificatory 
systems 

Cluster Analysis or Numerical Taxonomy - the science 

of constructing mathematical classificatory 
systems 

In less formal terms. Cluster Analysis includes all mathe- 
matical methods of classifying objects into sets so as to 
represent complex data in a simpler way which will serve as 
a fruitful source o'f hypothesis. 

/ Cluster Analysis is a two stage process. The first stage 
is to choose quantifiable attributes that describe the objects, 
and then use these attributes to measure the pair-wise dissim- 
ilarity among the objects. The second stage is to represent 
these dissimilarities’ by an appropriate classificatory system 
or display. 

J The input to Cluster Analysis is normally an n x m matrix 

of data, measurements of m attributes for each of n objects. 



7 



The outpxit from Cluster Analysis is normally one of three 
displays : 

A hierarchical classification, commonly called a tree 
diagram or dendrogram; 

A partition of the objects into mutually exclusive sets, 
each set described by a "profile" or vector of m average 
attribute values; 

A "clumping" of the objects into sets that may overlap, each 
set again described by a profile, 

^ The value of these outputs is that they summarize the original 
data objectively and they tend to highlight subtle interactions 
in the original data, enabling a user to formulate reasonable 
hypotheses about these interactions. 

B. PREVIOUS APPLICATIONS OP CLUSTER ANALYSIS, 

Cluster Analysis was developed in the eighteenth century 
by botanists and biologists attempting to inject more objec- 
tivity into their classifications of plant and animal specimens 
(the familiar phylum-genus-species scheme). Subsequently the 
same technique was used by geologists. Most recently. Cluster 
Analysis has found numerous applications in the social sciences, 
particularly in psychology. Reference 1 describes an applica- 
tion that is representative. 
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II . PROPOSED APPLICATIONS OF CLUSTER ANALYSIS 
IN THE STATE DEPARTMENT 



The United States State Department is currently trying 
to revitalize its policy-making and resource-allocating 
functions, a la the Defense Department metamorphosis under 
Robert McNamara. This revitalization effort has been under- 
way for eight years now. In that time there has been published 
a plethora (References 2 and 3 are representative) of "master 
plans" for the incorporation of Systems Analysis in the State 
Department . 

This paper does' not propose another master plan, but 
merely suggests that a single existing statistical analysis 
technique has useful applications within the State Department. 
The existing technique is Cluster Analysis, and the potential 
applications within the State Department are described and 
demonstrated in the pages that follov;. 

A. FIRST APPLICATION: TO HIGHLIGHT INTERACTIONS OF VARIABLES 

1 . General Description 

In this application. Cluster Analysis highlights the 
interactions of several variables the same way a scatter plot 
would for two variables. It inputs an n x m matrix of data 
(n countries, each described by m variables) and outputs a 
dendrogram. The dendrogram itself says nothing about inter- 
actions among the variables. But it is a simple matter to 
select a clustering level (where k = the number of clusters) 
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and plot the distribution of the m variables within each 
cluster. Comparisons among these plots should bring out all 
significant interactions among the variables. In particular, 
it should highlight mutual interaction among three variables 
or even among four variables just as easily as it highlights 
a tvro-way interaction. This is a potential not shared by 
factor analysis and regression techniques. 

2 . Scenario for Demonstration 

The United States Constitution lists Freedom of the 
Press, Freedom of Speech, and Freedom of Assembly as inalien- 
able human rights. Although one might argue that these 
precise terms have been eclipsed by communications technology, 
most of the Vfestern world would agree that "free and facile 
communication among the people" is an essential quality in a 
free and productive society. Having tentatively accepted 
this thesis, a sociologist or political scientist might well 
wish to dissect the concept of "free and facile communication 
among the people," to define it in quantifiable terms. More- 
over,- a policy planner in the State Department might well wish 
to go one step further: to use this quantitative definition 

in a comparative study of the countries of the world. Such 
comparisons are made every day with respect to Gross National 
Product, Life Expectancy, etc. Why not also tabulate a FFCAP 
(free and facile communication among the people) Index? 

Assuming that the State Department considered it 
worthvjhlle to develop such an index, they would probably task 
a team of their sociologists to propose a list of ineasurable 
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factors that either contribute to or detract from "free and 
facile communication among the people." This team would 
certainly appreciate the practical advantages of building 
this list around statistics that had already been measured, 
and using the existing data to continually validate their 
theories against the real world. 

Thus they would probably be faced, early on in their 
proceedings, with a large volume of existing data to be 
perused, or analyzed in a very general sense. At this point 
they could profit greatly from applying Cluster Analysis to 
highlight the interaction of variables. 

3. Choice of Data 

To demonstrate this application, the author has usurped 
the role of State Department sociologist and selected the 
follov/lng statistics as "measurable factors that either contri- 
bute to or detract from free and facile communication among 
the people": 

Variable 1. Concentration of Population in Cities, 1965 

Variable 2. Radios per 1000 Population, 1965 

Variable 3. Students in Higher Education (Third Level) 
per One Million Population, 1965 

Variable 4. Ethno-Linguistic Fractionallzatlon 

Variable 5. Press Freedom Index, 1965 
See Appendix A for definitions of these variables. 

It is readily admitted that this list is not as 
complete as it should be. In particular, "Literacy Rate" is 
conspicuous by its absence, and some measure of newspaper 
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circulation seems a necessary counterpart to Variable 2. 

The reason for such omissions was unavailability of data, 
an affliction that is widespread among independent researchers 
but not shared by insiders at the State Department. 

The unavailability of data to this researcher imposed 
another artificiality on this demonstration besides the omission 
of some desirable variables. Table I displays values of the 
aforementioned variables for only 85 of the 136 nations in the 
world. It was necessary to delete the other nations because 
of excessive missing data. 

. Choice of Dissimilarity Coefficient 

This section describes the process of converting the 
data in Table I to a matrix of Dissimilarity Coefficients. 

The first decision point was to specify a formula for 
the Dissimilarity Coefficient (DC). The DC is a single real 
number specifying the amount of dissimilarity betv/een Country 
A and Country B, obtained by somehow combining the five data 
points describing each country. There are many different 
formulas for transforming these ten data points to a single 
DC. Cormack presents a concise but comprehensive summary of 
all the common formulas in Table 1 of Ref. 

In the situation at hand it was decided to use a 
Euclidean Distance, standardized by range. That is, 

5 

DCd.d) =. I 

V=1 '' 

where W = ^ 



12 



TABLE I DATA INPUT TO HIGHLIGHTING DEMONSTRATION 
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Euclidean Distance v;as preferred to others simply 
because of Its geometric. Intuitive appeal. 

On the other hand, a more elaborate rationale went 
Into the decision to standardize by range. First of all It 
was decided that some type of standardizing (scaling) would 
be appropriate. Most of the literature argues convincingly 
that scaling is inappropriate when the difference in scale 
between two variables may be intrinsic; but no such intrinsic 
differences seemed likely In the five variables used here. 
Moreover, using unstandardized Euclidean Distance In this 
situation would clearly result In the DC being driven by 
Variables 2 and 3 while Variables 1 and k would be virtually 
Ignored, and there is no a priori reason to intentionally 
emphasize one variable over another In this application. 

Having decided to use some type of scaling, there 
were many types to choose from, namely scaling by standard 
deviation, scaling by range, and scaling by some other 
heterogeneity measure (see page 326 of Ref. 4 for a comparative 
discussion). Since the data distributions were mixed there 
was no compelling theoretical reason for choosing one scaling 
method over another. Eventually, scaling by range was 
selected for its simplicity. It would be interesting to see 
if scaling by standard deviation would significantly change 
the end result (dendrogram) from that obtained here; but this 
was not done . 

Using Euclidean Distance standardized by range, the 
425 data points in Table I were transformed into a matrix of 
3570 Dissimilarity Coefficients. 
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5 . Choice of Algorithm 



There are several methods of proceeding from the matrix 
of Dissimilarity Coefficients (DC) to a partition (or a dendro- 
gram of partitions) of the countries. Choosing one was the 
next decision point in this demonstration. All methods in 
general use fall into one of three categories: ' 

a. Agglomerative Algorithms - a series of successive 
fusions of the 85 countries into groups. 

b. Divisive Algorithms - a partitioning of the complete set 
of countries successively into finer partitions. 

c. Reallocative Algorithms - successive reallocation of 
individual countries between the sets of some 
initial partition. 

It was first decided that a reallocative algorithm would be 
inappropriate because it requires an initial partition, and 
there was no a priori evidence to suggest vjhat that partition 
should be. Between the two remaining alternatives, theoretical 
considerations did not yield a preference: in nearly every 

case, agglomerative and divisive algorithms produce identical 
dendrograms. The agglomerative algorithm was selected because 
its details have been more thoroughly documented in the 
literature . 

Within the family of agglomerative algorithms there 
are at least eight documented alternative "sorting strategies" 
or formulas for determining the DC between cluster (k) and 
cluster (Ij), using the DC between cluster (k) and cluster (i) 
and the DC between cluster (k) and cluster (j). If the matrix 
of Dissimilarity Coefficients contains natural and compelling 
clusters, each having strong internal cohesion and strong 
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external Isolation, then the choice of sorting strategy is 
not a critical one. But if natural and compelling clusters 
are not present, different sorting strategies can produce 
markedly different dendrograms. The eight common sorting 
strategies are explained in Chapter 3 of Ref. 4. Of those 
eight, the Complete Linkage-Furthest Neighbor sorting strategy 
and the Single Linkage-Nearest Neighbor sorting strategy 
represent the extremes. The others may be thought of as 
compromises betv;een these two. The Complete Linkage-Furthest 
Neighbor sorting strategy can be expressed mathematically as 

DC(k,ij) = max( DC(k,i), DC(k,j) ) 

It produces compact clusters having high internal cohesion; 
but it may sacrifice external isolation when natural and 
compelling clusters are not intrinsic in the data. At the 
other extreme, the Single Linkage-Nearest Neighbor sorting 
strategy can be expressed mathematically as 

DC(k,lj) = mln( DC(k,i), DC(k,j) ) 

It tends to produce chains of objects in addition to, or 
instead of, compact clusters, especially when natural and 
compelling clusters are not intrinsic in the data. In some 
applications this tendency is desirable. 

For the demonstration at hand compact clusters were 
considered far more desirable than chains,' and the Complete 
Linkage-Furthest Neighbor sorting strategy was selected. It 
would be interesting to see if one of the compromise sorting 
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strategies, such as Group Average, v/ould significantly change 
the end result (dendrogram) from that obtained here; but this 
was not done. 

The dendrogram in Drawing 1 was obtained using the 
Complete Linkage-Furthest Neighbor sorting strategy in an 
agglomerative algorithm. The computer program is listed at 
the end of this thesis for information. It should be noted 
that the matrix of Dissimilarity Coefficients were standardized 
to the (0.0, 100.0) interval, using scaling by range, before 
they were input to the clustering algorithm. But such 
standardizing v;as made for computational convenience only. 

Its single effect v;as a monotonic transformation of the 
numerical scale across the top of the dendrogram. The shape 
of the dendrogram was unaffected. 

6 . From Dendrogram to Cluster Profiles 

Having obtained the dendrogram in Drawing 1, and 
recalling that the purpose here was to highlight the inter- 
actions among variables, it remained only to select a level 
of clustering, identify the partition of countries there, 
plot the distributions of variables within each cluster, and 
compare these plots. But several iterations of this process 
were required before the interactions among variables began 
to appear. 

The first attempt v;as at level k = 3 . Here the 
United States appeared alone in one cluster, and the other 
two clusters contained ^1 countries and ^3 countries 
respectively. Apparently the United States was alone because 
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of its extreme values in variables 2 and 3- At this point 
the United States was evaluated as an outlier and was not 
Included in subsequent comparisons. Within each of the other 
two clusters, the mean and standard deviation of each of the 
five variables were computed. After a short perusal of these 
20 statistics it became apparent that no interactions among 
variables -were highlighted at this level. Within four of the 
five ^?arlables, the two means were displaced from each other 
by less than the sum of their standard deviations. 

For the second iteration, level k = 7 was chosen. 

Again the pair-wise displacements between means v/ere compared 
to standard deviations. The standard deviations were definitely 
smaller here than they had been at the k = 3 level : within 

cluster homogeneity had improved. Between cluster hetero- 
geneity had Improved to a lesser extent : many of the means 

were well separated but several others were not. 

For the third iteration, level k = 8 was chosen. Here 
there were three clusters containing only one country each. 

All three were dismissed as outliers, leaving five clusters 
for further study. Within each of these five clusters, the 
mean and standard deviation of each of the five variables 
were computed, using a straightforward computer program. 

These statistics are listed in Table II and displayed graphi- 
cally in Table III. After a relatively brief perusal of 
Table III, several possible interactions among the variables 
came to mind. Then a quick double-check of Table II confirmed 
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that four of those possible interactions were probable 



Interactions. These probable Interactions are listed in 
Table IV. 

None of these "probable Interactions" was verified 
mathematically. The first one would have been relatively 
easy to check out, by computing 10 correlation coefficients. 

But the others would have required considerably more ingenuity. 
Since the purpose here was to demonstrate a new application of 
cluster analysis rather than to deduce substantive results, 
mathematical verification was considered beyond the scope of 
this thesis. 

Hovjever, there was a further step, within the scope 
of this thesis, that might have been pursued but was not. It 
would have been logical to proceed next to another cluster 
level (perhaps k = 13) and again look for interactions. Such 
reiterations might well confirm or refine the interactions 
already deduced, and highlight additional interactions as well. 

B. SECOND APPLICATION: TO CLASSIFY COUNTRIES OBJECTIVELY 

1 . General Description 

In this application, the user presumes to understand 
the variables used, and the interactions among these variables, 
at least on a superficial level. The purpose here is not to 
research the variables, but rather to objectively classify the 
countries. The previous application (to highlight Interactions 
among variables) produced a dendrogram only as an intermediate 
step before producing "cluster profiles" as the final product. 
But the current application seeks only the dendrogram itself. 
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TABLE II CLUSTER PROFILES OUTPUT FROM HIGHLIGHTING DEMONSTRATION 
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CLUSTER WAS SINGULAR 
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CLUSTER KAS SINGULAR 
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Table III 



CLUSTER PROFILES, OUTPUT FROM HIGHLIGHTING DEMONSTRATION, GRAPHED 
Variable 1. Concen. of Population in Cities 



min=0,0 



.05 



.10 



.15 



0.21=max 



-• 2 

3 



Variable 2 . Radios per 1000 Population 
min=5*li 200 IiOO 600 



800 1000 1233 . 5 =niax 



■J 2 



Variable 3* Students in Higher Ed. per Mil. Pop. 
min=6.0 8000 16000 2U000 



28liOO=max 



j 2 



Variable li. Ethno-Linguistic Fraction. 
min=0.0 ,2 ,l 4 ,6 



.8 0,926=roax 

■ 1 



Variable 5* 
min=-3.5l 



Press Freedom Index 

-2 



3.06=max 

■ 1 
• 2 

' 3 



26 



TABLE IV 






✓ 



RESULTS OP HIGHLIGHTING DEMONSTRATION; 
PROBABLE INTERACTIONS AMONG VARIABLES 
DEDUCED AT CLUSTER LEVEL k = 8 



1. There appear to be no significant pair-wise correlations 
(either positive or negative) among ‘the five variables. 

2. A high value in Variable 3 tends to be accompanied by a 
high value in Variable 5. But the Inverse and converse 
are not true (i.e., a high value in Variable 5 does not 
imply a high value in Variable 3, and a low value in 
Variable 3 does not imply a low value in Variable 5). 

3. A very high value in Variable ^ tends to be accompanied 
by a low value in Variables 1, 2, and 3* 

, The combination of high value in Variable 1 and a low 
value in Variable ^ tends to be accompanied by a high 
value in Variable 5* 






For ready reference, the variable names are: 



Variable 1. 
Variable 2. 
Variable 3. 

Variable 
Variable 5. 



Concentration of Population in Cities, 1965 

Radios per 1000 Population, 1965 

Students in Higher Education (Third Level) 
per One Million Population, 1965 

Ethno-Linguistlc Fractionallzat ion 

Press Freedom Index, 1965 
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to objectively confirm, or perhaps modify, the user's previous, 
subjective, classifications. 

2 . Scenario for Demonstration 

People commonly think about the countries of the world 
as members of clusters. They use labels like "the Western 
V/orld" , "the Communist Bloc", "the Have’s" and "the Have-not's" 
every day, and they frequently hear more esoteric terms like 
"trl-polar world", "five-polar world" and "spheres of Influence", 
all of which have classlflcatory overtones. 

No doubt such classifications are convenient and useful; 
but as they exist now, many are also subjective and confusing. 
When two speakers discuss the behavior of "the Communist Bloc" 
without first enumerating the members of that bloc, they may 
disagree violently until they discover that one of them Includes 
Cuba and Chile In his definition but exludes Yugoslavia, while 
the other has done the reverse. If our classifications of 
countries are useful but subjective. It would seem desirable 
to make them more objective. 

Imagine that a political scientist In the State 
Department wished to Inject some objectivity Into the terms 
"Western World", "Communist Bloc", "Soviet Bloc", etc. His 
first step would probably be to Identify the several theories 
(form of government. Internal economic system, external 
political ties, external economic ties, etc.) that are commonly 
used to define the terms In question. Then he would probably 
select one of these theories for quantification and search out 
measurable factors (preferably statistics that had already 
been measured) with v;hlch to express It. 
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For example, imagine that he selected the theory that 
external economic ties are the prime mover in the concept of 
bloc membership. Then his search for measurable factors v;ould 
certainly lead to statistics such as level of foreign aid 
received from every other country, value of imports received 
from every other country, and value of exports sent to every 
other country, each of these statistics prorated against the 
host country's GNP and/or population. 

The final step for our State Department researcher 
would be to combine these measurable factors mathematically 
so as to output a bloc membership label for each country of 
the world; that is, he would write a "factors-to-bloc trans- 
formation”. If he were not acquainted with Cluster Analysis, 
he might well try to write a single function of the form 

bloc membership = F(aidyg, aid^^gj^, aid^j^, ... 

trade^g, trade^gg^, ^^adej^p^^, trade^^j^^ , 
GNP, population, etc.) 

where "bloc membership” is a discrete variable which can take 
on three or perhaps five predetermined values. But such a 
function would probably be crippled by two weaknesses: exces- 

sive complexity and theoretical Inadequacy. The reader can 
certainly visualize how complicated such a function would have 
to be in order to have broad applicability. Moreover, no 
matter hov; complex the function, it would necessarily ignore 
an obvious fact about blocs of countries: two countries can 
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be closely bound in a bloc not by economic dependency on each 
other but by their simultaneous economic dependency on an 
intermediate country. 

Cluster Analysis has far more potential as a "factors- 
to-bloc" transformation. It does not share the dual weaknesses 
of the functional transformation. First of all, the ability 
to group two countries together through an Intermediary is 
intrinsic to every clustering algorithm (so long as the 
Complete Link-Furthest Neighbor sorting strategy is not used). 
And secondly, Cluster Analysis requires that the user define 
only a transformation from measurable factors to a p»air-wise 
Dissimilarity Coefficient rather than a transformation from 
measurable factors to bloc membership. Surely the former 
should be less complex than the latter. 

3 . Choice of Data 

To demonstrate this application, the author again 
usurped the role of State Department political scientist and 
selected the following statistics as measurable factors with 
which to objectively classify countries into blocs: 

Variable 1. Gross National Product per Capita, 1965 

Variable 2. Trade as percentage of Gross National Product, 
1965 

Variable 3. Soviet Aid' per Capita, 195^ - 1965 

Variable U.S. Economic Aid per Capita, 1958 - 1965 

Values of these variables for each of 85 countries are listed 
in Table V. 
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TABLE V - DATA INPUT TO OBJECTIVE CLASSIFICATION DEMONSTRATION 
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TABLE V , CONTINUED 
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Here again the unavailability of data made the 
demonstration artificial. As was asserted in the preceding 
section, the blocking of countries by external economic ties 
should depend primarily on pair-wise data. But this researcher 
did not have access to any standardized, comprehensive pair- 
wise data. Variables 3 and 4 above are pair-wise but not 
comprehensive. Foreign aid is provided in substantial amounts 
by countries other than the United States and the Soviet Union. 
But this researcher could not locate any but the most piece- 
meal data on other donors. Variable 2 above is not pair-wise 
at all. Pair-wise trade data is collected by the Interna- 
tional Monetary Fund, and their data is both standardized 
and reasonably comprehensive. But that data is not made 
available to the public in comprehensive form.^ V/ithout 
pair-wise trade data it is virtually Impossible to construct 
a logical theory for blocking countries by external economic 
ties. Nevertheless, this demonstration was carried through 
to completion because its purpose is not to deduce substantive 



^After completion of the research described here, the 
author did obtain access to the IMF data and began a Cluster 
Analysis on it. But the results were not obtained in time 
to incorporate them in this thesis. See Section II. B. 7 for 
a description of the work in progress. The data was obtained 
through the Inter-University Consortium for Political Research, 
on computer tape. The reason why the data is not generally 
available was obvious; its sheer magnitude. For purposes of 
data collection, the IMF defines 207 countries, and 207 
countries taken two at a time produce 21,321 trading combina- 
tions. The- complete data file contains almost 500,000 numbers. 
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results but to demonstrate a procedure. And that procedure 
can still be demonstrated using the foreign aid data (Variables 
3 and which is pair-wise, although Incomplete. 

4 . Choice of Dissimilarity Coefficient 

This section describes the process of converting the 
data in Table V to a matrix of Dissimilarity Coefficients. 

When Cluster Analysis was used to highlight the inter- 
actions of variables (Section II.A.^I above), the choice of 
DC was motivated by a desire to have all five variables 
weighted equally, to prevent the user's preconceptions from 
affecting the results. Precisely the opposite is true here. 
Here the author presumed that he already knew how the 
variables interact. He wanted to Incorporate that knowledge 
into the DC. The DC was constructed using the following 
rationale : 

First of all, it was decided that the DC between a 
foreign aid donor and any other country should be inversely 
related to the level of that foreign aid. Thus, for a first 
cut, the formulas 

DC(US,i) Y and DC(SOV,i) " 

± + usaiUj^ ^ ^ sovaid^ 

were considered. Next it was observed that 27 of the 85 
countries received foreign aid from both the United States 
and the Soviet Union. To incorporate relative dependency 
into the formulas, it was decided to insert a ratio of aid 
levels. Hence the follov7ing formulas were considered. 
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DC(US,i) 



and 



DC(SOV,i) ^ 



1 + sovald, 

1 

1 + usaid^ 



1 + usaid. 
^ 

1 + sovald. 

i 



These formulas seemed reasonable except that the same level 
of foreign aid has smaller Impact on a rich country than It 
does on a poor country. So It was decided to Insert GNP^ as 
a scaling factor wherever an aid term appeared In either 
formula. But this Insertion tended to greatly reduce the 
size of the aid terms with respect to the "1" terms. There- 
fore the "1” terms were arbitrarily reduced to "0.1", 
producing 



DC(US,1) = 



0.1 + 



0.1 + 



sovaldj 
GNP^ 
us aid . 



GNP, 



and DC(S0V,1) = 



0.1 + 



0.1 + 



usald^ 

"GNP^ 

sovaldj 

GNP^ 



The fact that DC(US,1) and DC(S0V,1) are reciprocals and the 
fact that they are dlm.enslonless had Intuitive appeal. The 
only apparent shortcomings were the two Imposed by unavaila- 
bility of data: aid from other countries Is Ignored, and 

palr-wlse trade Is Ignored. Although a total trade figure 
was available, there seemed to be no logical way to substitute 
It for the missing palr-wlse figure. 

At this point It was verified that the DC(US,1) 
formula would apply to every country dyad In which the United 
States Is a member, except for the United States - Soviet Union 
dyad. Similarly, the DC(S0V,1) formula applies to every 
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country dyad in which the Soviet Union is a member, except 
for the United States - Soviet Union dyad. Thus it remained 
to construct formulas for the United States - Soviet Union 
dyad and for all dyads in which neither the United States nor 
the Soviet Union is a member. Hopefully the same formula 
would apply to both. 

But here the lack of pair-wise trade data was really 
crippling. The only pair-wise economic ties of any signi- 
ficance involved pair-wise trade. The only logical formula 
necessarily involved the inverse of pair-wise trade. There 
seemed no natural way to use the available data on total 
trade. Finally, in desperation it was rationalized that a 
country whose foreign trade is large with respect to .its GNP 
tends to have closer ties with another country in the same 
situation. It was decided that the nucleus of the formula 
should be 

DC(i,j)^ [trade. - trade, j 

J 

But because of the weak theory here as compared to the 
rigorous formulas for DC(US,i) and DC(SOV,i) it was decided 
to diminish the effect of trade difference when foreign aid 
recipients are involved. Hence it was decided to expand the 
formula to 

J sovaid. +usaid. + sovaid . + usaid . 
jtrade^ - trade^ j ^ 
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The dearth of theoretical foundation here is admitted. It 
casts suspicion on the values in the DC matrix and on the 
dendrogram finally obtained. _But the reader is again reminded 
that the purpose here is to demonstrate the procedure, not to 
deduce substantive results. 

Turning from the substance to the procedure, there is 
a significant departure from normal Cluster Analysis procedure, 
taken above, that warrants explanation. Two completely differ- 
ent DC formulas have been developed, one to be used when the 
United States or Soviet Union is a dyad member and another 
to be used the rest of the time. The mathematical significance 
of this duality is that the DC formulas, taken collectively, 
produce gross violations of the metric inequality, which is 

DC(a,b) + DC(b,c) ^ DC(a,c) for all a,b,c 

Generally, it is desirable although not essential that a matrix 
of DC’s satisfy the metric inequality. When they do not, the 
clustering algorithm can be expected to produce high 
"distortion” between the matrix of DCs and the dendrogram. 
(Loosely defined, "distortion" is the difference between 
DC(i,j) and the level at which country i and country j cluster 
together in an agglomerat ive algorithm.) But of what signi- 
ficance is high distortion? The word carries derogatory 
connotations, but is distortion really undesirable in Cluster 
Analysis? This author maintains that it depends on the purpose 
of the clustering. In the "highlighting of variables" appli- 
cation, distortion was not desirable: figuratively speaking. 
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each country had been plotted in five-dimensional space and 
the clustering algorithm was searching for natural clusters, 
as plotted. But in this "objective classifying of countries" 
application, distortion is natural: the original pair-wise 

similarities specified in the DC matrix cannot be expected 
to be representable in Euclidean space, and during the 
clustering it is desired that these original similarities be 
affected by intermediate countries. With this reasoning, it 
is asserted that violation of the metric inequality is neces- 
sary and that the use of two or more DC formulas is acceptable. 



in Table V were transformed to a matrix of Dissimilarity 
Coefficients. ^ 



5 . Choice of Algorithm 

Here, as in the highlighting demonstration, the first 
decision point was to choose among the agglomerative , divisive 
and reallocative algorithms. Again the divisive algorithms 
were discarded because they are not as vjell documented as 
their agglomerative counterparts. The reallocative algorithms 
did not apply because they require that the DC be a metric. 
Hence the agglomerative algorithm was selected. 



The Complete Linkage-Furthest Neighbor strategy was eliminated 
from consideration here; it does not permit any chaining, 
which is desirable in this application. On the other end of 
the spectrum, the Single Linkage-Nearest Neighbor sorting 
strategy maximizes chaining, often to the extent that natural 



Using the formulas developed above, the 3^0 data points 







The final decision was to select a sorting strategy. 



38 



c 
o 
B -H 
O ^ 

U cd 

o 

♦H 

^ Cm 

Q. CO 
4 ^ CO 

::s cd C 
O iH o 
o -H 
S -p 
CM Cd <D Cd 







O 

o 

CM 



O 

VO 



o 

♦ 

CM 

H 



O 

GO 



O 



O 

o 



«a< 

3Ci- 

<QC 

OO- 



— o 

:<Q-zu 

i<_( — zc 
(i — «aa 3f 'jaf 

•3X rc > 

• J- a.»-> (jOO— 



Ooc 



uj<iat 3i/'*30 Z<<OZO~. 

2<>Ju <'^TOa oj— u— -‘-<'00 jt Oot ■ _ . 

I— C5^ i-Z^-X 

I— u z x^'io t- _!(.•? ti>zv<iZ5:<'r*/'>ZX>Batza-/ >/a 



o< 

< uo 

_J < — < 

Z lu >ac — 

0>-3»/i--_i S3 
»si33:<<ar 



o < 



a < 



uz 

<I Z *— <1 
<— — t'* 



0>- — UJ” o <*<x 

3 — < t->u OQ. — 3r OO»-»-'-Zii> 

ZtZZ3ZriO C >-^0' ui<ts/'OLUij4/»Ziii<Z 
O""* Ou — lO 

a3_.T 30 U'wa4,<aca< — a03— ><.3^ J 



oj< <ix a 

»4i: Z“-»— 'T»a 
a— O J< C 
3XzCot. 



Q) 

iH 

05 

o 

CO 



39 



clusters are obscured. For this application it was deter- 
mined to use one of the compromise sorting strategies. Among 
these. Group Average sorting seemed to correspond with the 
concepts of bloc membership, ratios of foreign aid, etc. It 
is expressed mathematically as 

DC(k,ij) = - -- - - DC(k,i) + DC(k,j) 

"i ^3 ^i 

The dendrogram in Drawing 2 was obtained using the 
Group Average sorting strategy in an agglomeratlve algorithm. 
But once again the reader is cautioned that the results are 
suspect . 

6 . Explanation of Dendrogram 

Despite the admitted artificiality of the results 
obtained here, the layman might appreciate an explanation of 
the Information available in any dendrogram produced by 
Cluster Analysis. 

The key to reading a dendrogram is the concept of 
"cluster level." By merely specifying a cluster level, the 
follov/lng information can be read from the dendrogram: tide 

number of clusters and the countries contained in each cluster. 
That is, there is a correspondence from cluster level to a 
partition of the countries. 

The scale at the bottom of Drawing 2 is a cluster 
level scale. Note that the minimum value of cluster level is 
0.0 at the far left and the maximum value is 20.0 at the far 
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right. A low cluster level specifies a partition having many 
small clusters, while a high cluster level specifies a parti- 
tion having a few large clusters. Thus cluster level can be 
thought of as a measure of the largest dissimilarity (or, 
equivalently, the weakest bond) present within any cluster in 
the partition. 

For example, consider cluster level 0.0, the minimum 
observed cluster level in Drawing 2. At cluster level 0.0, 
the 85 countries are partitioned into 77 clusters. Seventy- 
two of these 77 contain only a single country. Four of the 77 
contain exactly two countries. And one cluster contains 5 
countries: Canada, Ireland, Switzerland, Sweden and Denmark. 

Since 0.0 is the minimum observed cluster level, we may conclude 
that the strongest possible bonds exist within every cluster. 
Specifically, v/e may conclude that Canada, Ireland, Switzerland, 
Sweden and Denmark are bound together by the tightest possible 
economic ties. Our mathematical model will not separate them 
even at the lowest cluster level. 

Consider next a slightly higher cluster level, say 1.3* 
Here we are permitting slightly weaker bonds to be present 
within clusters. V/e find that the 85 countries are here 
partitioned into 55 clusters. Thirty-three of those clusters 
contain a single country, twelve contain exactly two countries, 
five contain exactly three countries, one contains four 
countries, and one contains nine countries. In the nine-country 
cluster, Canada, Ireland, Switzerland, Sweden and Denmark have 
been joined by New Zealand, South Africa, France and Australia. 



i/1 



V/e may conclude that slightly weaker economic ties bind the 
four new countries to the original five. 



Similar inferences can be drawn from any dendrogram 
produced by Cluster Analysis. 

7 . Work in Progress 

Throughout this second demonstration of Cluster 
Analysis it has been emphasized that the unavailability of 
pair-wise trade data made the demonstration artificial. But 
this artificiality can soon be removed. Pair-wise trade data 
v;as recently provided to this author through the Inter- 
University Consortium for Political Research [Ref. 9]- Time 1 
will not permit this author to complete a Cluster Analysis / 



on the data, but if another researcher chooses to undertake / 



/ 



it, the following plan of attack is suggested. 



/ 



a. Step 1 - Reduce data file to manageable size 

The ICPR data file contains approximately 333 >720 
pairwise trade data: annual trade values, in millions of 

U.S. dollars, for the years 1958 through 1968, among 207 differ- 
ent "countries." Many of these "countries" are actually 
colonies and many others have negligible foreign trade except 
with a single "sponsor country." The logical first step is 
to selectively reduce the size of the data file by eliminating 
the insignificant "countries", and by selecting a single year 
and eliminating the other nine. It is recommended that all 
"countries" be eliminated except the I 36 nations having a ^ 
population of one million or more and those smaller nations 
having membership in the United Nations as of I 968 . These 
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136 nations are listed on pages 1 through ^ of Ref. 8. It is 
further recommended that the year 1967 be used and the rest 
be eliminated temporarily. The author has determined that, 
through the first one-sixth of the file, 1967 has fewer zero 
entries than any other year (a zero entry signifies either 
trade less than 100,000 dollars or missing data). This 
selective reduction of the data should reduce the file length 
to about one eighth its original length. 

b. Step 2 - Sort and combine data 

Preparatory to sorting the data, the reduced data 
file should be stored on either a disk or a data cell rather 
than magnetic tape. The ICPR normally provides the data on 
tape, and tape is a satisfactory input to the data reduction 
process in step 1 because that process can be sequential, 
reading the file once from beginning to end. However, the 
sorting process about to be described cannot read the file 
sequentially, and magnetic tape is a very inefficient input 
to processes that must search the data. 

The ICPR data file does not list one trade figure 

per country dyad per year. It lists up to four figures, 

namely, 

1. Value of exports from i to j, as reported by 1 

2. Value of exports from i to j , as reported by j 

3. Value of exports from j to 1, as reported by j 

Value of exports from j to i, as reported by 1 

Hopefully numbers 1 and 2 are approximately equal and numbers 

3 and 4 are approximately equal. If so, then total trade 
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between i and j is the sum of 1 and 3- It is recommended 
that this approximate equality be assumed for the initial run 
of this "sort and combine" process. Then the process is 
simple : search the file for the first record involving the 

i-j dyad; identify it with respect to direction of trade, 
regardless of reporting country; continue searching for the 
second record involving the i-j dyad; identify it with respect 
to direction; if the directions are opposite then sum the two 
values and store them; if the directions are the same then 
ignore the second value and continue searching for the third 
record; and so on. The reason why shortcuts are in order for 
the initial run is that this "sort and combine" process will 
have to be performed 9l80 times (136 countries, taken two at 
a time, yields 9l80 different combinations). 

c. Step 3 - Choose a Dissimilarity Coefficient 

The follov/ing formula is recommended as a DC, at 
least initially: 

1 + Trade. . 

More elaborate formulas can be developed later by incorporating 
the rationale in Section II. B. 4 of this thesis. 

d. Step 4 - Choose a Clustering Algorithm 

It is recommended that an agglomerative algorithm 
with Group Average sorting strategy be used, for the same 
reasons that it was selected in Section II. B. 5 above. This 
involves making the following additions and substitutions in 
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the computer program listed at the end of this thesis: 

Immediately before DO 73 E=1,N insert the following two 
statements : 

RATA = S(A,A)/(S(A,A)+S(B,B)) 

RATB = S(B,B)/(S(A,A)+S(B,B)) 

In place of 

DS(E) = AMAX1(S(E,A) ,S(E,B) ) 

substitute 

DS(E) = RATA*S(E,A) + RATB*S(E,B) 

Similarly, in place of 

70 DS(E) = AMAX1(S(A,E) ,S(B,E) ) 

substitute 

70 DS(E) = RATA*S(A,E) + RATB*S(B,E) 

And finally, in place of 

71 DS(E) = AMAX1(S(E,A),S(B,E)) 

substitute 

71 DS(E) = RATA*S(E,A) + RATB»S(B,E) 
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III. CONCLUSION 



This thesis has demonstrated tvio potential uses of Cluster 
Analysis in which the nations of the v/orld are treated as 
measurable objects. The substantive results obtained in each 
demonstration are not presented as conclusions; they were 
derived incidentally v;hile demonstrating methods. It is 
asserted that the two uses Illustrated here, markedly differ- 
ent in several respects, are representative of a wide range of 
applications for Cluster Analysis in the fields of political 
science and international relations. Although Cluster Analysis 
was developed for the physical sciences and has so far received 
scant attention outside that context, it is readily adaptable 
to the social sciences. In particular, it is extremely well 
suited to model building and statistical analysis involving 
the nations of the world. As such, it warrants the attention 
of the U.S. State Department. 
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DATA 

Except for three data points, all data used in this thesis 
were made available by the Inter-University Consortium for 
Political Research. The data were originally collected by 
Charles Lewis Taylor and Michael C. Hudson. Neither the 
original collectors of the data nor the consortium bear any 
responsibility for the analysis or interpretations presented 
here . 

Follov;ing are the precise definitions of the nine vari- 
ables used in this thesis. All definitions are extracted 
verbatim from Ref. 8. 

Variable name: Concentration of Population in Cities, 19^5 

Definition: Concentration is defined as: the sum over all 

cities of the squares of the proportion of the total popula- 
tion residing in each city. Concentration is higher the fewer 
cities and the greater the size of the largest city relative 
to the total population. [Ref. 8, p. l6] 

Variable name: Radios per 1000 Population, 1965 

Definition: Figures relate to all types of receivers including 

those connected to a re-distribution system. They relate 
either to the number of licenses issued or sets declared or 
to the estimated number of receivers in use. In many countries 
a license may cover more than one receiver in the same house- 
hold. Data exclude television sets. [Ref. 8, p. 32] 



Variable name: Students in Higher Education (Third Level) 

per One Million Population, 19^5 . 

Definition: Data refer to the enrollment in all institutions 

of education at the third level, i.e., degree granting and 
non-degree granting institutions of both private and public 
higher education of all types. These include universities, 
higher technical schools, teacher training schools, theological 
schools, etc. As far as possible part time students are 
included in the figures but correspondence courses and auditors 
are generally excluded. [Ref. 8, p. 4l] 

Variable name: Ethno-Llnguistlc Fractionalization 

Definition: The main source for this variable (Atlas Narodov 

Mira) makes little distinction between ethnic and linguistic 
differences in its definition and collection of data. Groups 
are determined not by their physical characteristics but by 
their roles, their descents and their relationships to others. 
An index of fractionalization calculated upon data from Atlas 
does correlate highly v;ith a similar index calculated upon 
linguistic data from other sources, but not quite highly 
enough to be considered the same indicator. Other sources 
used here report only linguistic data. Index of fractionaliza- 
tion was calculated by the following formula: 

P = 1 (N subl / N) (N subi - 1/N-l) 

where N subl = number of people in the 1th group 
and N = total population [Ref. 8, p. ^<69 
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Variable name: Press Freedom Index, 1965 

Definition: This index, created by the School of Journalism, 

University of Missouri, is "designed to measure the indepen- 
dence of a nation's broadcasting and press system and its 
ability to criticize its own local and national governments." 
The index is comprised of the Judgements of panels of native 
and foreign newsmen on 23 aspects of the press (e.g., extent 
of legal controls, licensing, government ownership, criticism 
and censorship). For a fuller description, see Ralph L. 
Lowenstein, "PICA (Press Independence and Critical Ability) 
Index: Measuring V/orld Press Freedom," University of Missouri, 

School of Journalism Freedom of Information Center Publication 
#166 (August, 1966 ). The index, which consists of averages 
of the Judges' scores, has a range from -^.00 for less freedom 
to +4.00 for more. [Ref. 8, p. II 6 ] 

Variable name: Gross National Product per Capita, I 965 

Definition: This variable was derived by dividing Gross 

National Product in millions of U.S. dollars by total popula- 
tion in thousands. Gross National Product is reported in 
constant U.S. dollars and refers to gross national product 
even for countries which normally report their national 
accounts in terms of net material product or other concepts. 
[Ref. 8, p. 65 ] 

Variable name: Trade as percentage of Gross National Product, 

1965. 

Definition: This variable was derived by dividing total trade 
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(imports plus exports, merchandise only) by Gross National 
Product. [Ref. 8, p. 69 ] 

Variable name: Soviet Aid per Capita, 195^ - 1965 

Definition: This variable v;as derived by dividing total 

Soviet aid by total population. Total Soviet aid data refer 
to Soviet economic credits and grants to countries in terms 
of thousand U.S. dollars for the period 195^/5 - I 965 . 

[Ref. 8, p. 107 ] 

Variable name: U.S. Economic Aid per Capita, 1958 - I 965 

Definition: This variable was derived by dividing total 

U.S. economic aid by total population. Total U.S. economic 
aid data refer to grants and loans and are given in millions 
of U.S. dollars for the period July 1, 1958 through June 30, 

1965 . [Ref. 8, p. 107 ] 

The three data points not provided by the ICPR are listed 
below. The ICPR data file listed all three as missing data. 

But in each case this author preferred to Introduce an approxi- 
mate (or even erroneous) value rather than eliminate the 
particular country from the Cluster Analysis. Hence the three 
values were estimated in the manner specified. Note that no 
two estimations involved the same country. All countries 
missing tv;o or more data (among the nine variables used) in 
the ICPR data file were omitted from the Cluster Analysis at 
the outset. 
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Country: Chile 

Variable name: Radios per 1000 Population, 1965 

Estimated value: 240.0 

Method of estimation: Average of values for Peru and 

Argentina. 

Country: Chad 

Variable: Students In Higher Education (Third Level) per 

One Million Population, 1965 
Estimated value: 230.0 

Method of estimation: Average of values for Mall, Upper Volta, 

Sudan and Cameroon. 

Country: Zambia 

Variable: Students In Higher Education (Third Level) per 

One Million Population, 1965 
Estimated value: 170.0 

Method of estimation: Average of seventeen neighboring 

countries . 
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